Device and method for clarifying dysarthria voices

ABSTRACT

A device and a method for clarifying dysarthria voices is disclosed. Firstly, a dysarthria voice signal is received and framed to generate dysarthria frames. Then, the dysarthria frames are received to retrieve dysarthria features. Finally, the dysarthria features are received. Without receiving phases corresponding to the dysarthria features, the dysarthria features are converted into an intelligent voice signal based on an intelligent voice conversion model. The intelligent voice conversion model is not trained by the dynamic time warping (DTW). The present invention avoids the phase distortion of the voice signal and provides more natural and clarified voices with low noise.

This application claims priority for Taiwan patent application no.109129711 filed on 31 Aug. 2020, the content of which is incorporated byreference in its entirely.

BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates to the technology for clarifying voices,particularly to a device and a method for clarifying dysarthria voices.

Description of the Related Art

Dysarthria patients are characterized by lisp because of theabnormalities in muscle strength and timbre and low speech speed whilespeaking, therefore it is hard for other people to understand thedysarthria patient's speech, which impacts the quality of the dysarthriapatient's life. Most of the dysarthria patients include stroke patients,cerebral palsy patients, and Parkinson's disease patients. Althoughdrugs for delaying neurodegeneration and rehabilitation treatments forimproving muscle control have been developed, the treatment effects ofthe drugs and rehabilitation treatments vary from person to person, andusually do not improve the conditions substantially. Accordingly,researchers have proposed to use the voice conversion technology toconvert the voices of patients into voices that normal peopleunderstand, thereby enhancing the clarity and comprehension of patients'voices.

The conventional voice conversion technology extracts the voice featuresof a dysarthria person and a target person, such a fundamentalfrequency, a Mel spectrum, and an aperiodicity. The conventional voiceconversion technology derives conversion functions to convert dysarthriafeatures into target features. Finally, a vocoder synthesizes theconverted features into voices. Dysarthria voices have technicaldifficulties in extracting the voice features and deriving theconversion functions. Thus, the features to improve a dysarthria voiceconversion system are implemented with log power spectrums and phases,wherein the log power spectrums and the phases are extracted using aFourier transform. The log power spectrums are inputted to a pre-trainedconversion model. The conversion model converts the log power spectrumsinto log power spectrums with enhanced comprehension without processingthe remaining phases. The converted log power spectrums and the phasesare synthesized into voices with enhanced comprehension using an inverseFourier transform. In the conventional technology, the implementedresults can obviously improve the comprehension of voices. However, thelog power spectrums with enhanced comprehension converted by theconversion model have a mismatch with the un-processed phases, whichcauses the synthesized voices to have a lot of noise in subjectivehearing.

Referring to FIG. 1 and FIG. 2, the conventional technology uses a deepneural network (DNN)-based voice conversion method to improve dysarthriavoices. FIG. 1 is a schematic diagram illustrating a device forclarifying dysarthria voices in the conventional technology. FIG. 2 is aschematic diagram illustrating a voice training system in theconventional technology. The device 1 for clarifying dysarthria voicesincludes a normalizer 10, a framing circuit 11, a short time Fouriertransformer 12, a normalizer 13, a log power spectrum mapping deepneural network (LPS mapping DNN) 14, a denormalizer 15, an inverse fastFourier transformer 16, and an interpolation circuit 17. The normalizer10 normalizes a dysarthria voice signal D. The framing circuit 11divides the dysarthria voice signal D into overlapping frames. Eachframe has 256 sampling points and a time length of 16 ms. The short timeFourier transformer 12 extracts frequency-domain information from eachframe, wherein the frequency-domain information includes magnitude M andphases Φ. The magnitude M is a log power spectrum. The pre-processed LPSmapping DNN 14 converts the magnitude M into reference features M′ of anormal person. The reference features M′ have better comprehension ofvoices. The inverse fast Fourier transformer 16 synthesizes the featuresM′ and the phases Φ to generate a voice signal in time domain. Since theframes overlaps to each other, the interpolation circuit 17 interpolatesthe voice signal to improve the comprehension of the voice signal andgenerate an intelligent voice signal V. The LPS mapping DNN 14 istrained by a voice training system 2. The voice training system 2includes a pre-processing circuit 20, a short time Fourier transformer21, a normalizer 22, and a deep neural network (DNN) trainer 23. Thepre-processing circuit 20 uses dynamic time warping (DTW) to aligndysarthria corpuses d to reference corpuses r of a normal person andframe the dysarthria corpuses d and the reference corpuses r to generatedysarthria frames and reference frames. Each frame has 256 samplingpoints and a time length of 16 ms. The short time Fourier transformer 21respectively extracts dysarthria features Md and reference features Mrfrom the dysarthria frames and the reference frames. Each feature has129 sampling points. The normalizer 22 normalizes the dysarthriafeatures Md and the reference features Mr, such that the dysarthriafeatures Md and the reference features Mr easily converge in training.The DNN trainer 23 trains the LPS mapping DNN 14 based on the dysarthriafeatures Md and the reference features Mr. The LPS mapping DNN 14 learnshow to convert the dysarthria features Md into the reference featuresMr. However, the device 1 for clarifying dysarthria voices causes phasedistortion. Presently, the LPS mapping DNN 14 has no better method toconvert the phases Φ into the reference features M′ for improvingcomprehension. The conventional technology uses the phases extracted bythe short time Fourier transformer 12 or sets all the phases to zero.However, the effect of the conventional technology is bad. Theintelligent voice signal V, which is synthesized by the inverse fastFourier transformer 16 and the interpolation circuit 17 based on thephases Φ and the reference features M′ having a mismatch with the phasesΦ, has obvious noise in hearing. In the conventional technology, anotherdevice for clarifying dysarthria voices includes a framing circuit, ashort time Fourier transformer, and a pre-processed Wave recurrentneural network (RNN). The framing circuit processes dysarthria voices togenerate frames. The short time Fourier transformer extracts log powerspectrums having 513 sampling points from each frame and uses the logpower spectrums as dysarthria features. The Wave RNN converts thedysarthria features as an intelligent voice signal. The Wave RNN istrained by a voice training system 3, as illustrated in FIG. 3. Thevoice training system 3 includes a pre-processed circuit 30, a shorttime Fourier transformer 31, and a voice trainer 32. The pre-processedcircuit 30 receives the dysarthria corpuses d and the reference corpusesr of the normal person. The dysarthria corpuses d have 319 sentences.The reference corpuses r also have 319 sentences. The pre-processedcircuit 30 uses DTW to align the dysarthria corpuses d to the referencecorpuses r. The pre-processed circuit 30 respectively generatesdysarthria frames Xd and reference frames Xr based on the dysarthriacorpuses d and the reference corpuses r. The short time Fouriertransformer 31 extracts from the dysarthria frames Xd log powerspectrums as the dysarthria features Md. The voice trainer 32 trains aWave RNN based on the dysarthria features Md and the reference framesXr. Although the Wave RNN converts the dysarthria features into theintelligent voice signal to avoid phase distortion, the requirements foraligning the dysarthria features Md with the reference frames Xr arevery strict. In general, it has a good effect to use DTW to align thepaired voice signals of normal people. It is not ideal to align thedysarthria corpuses d to the reference corpuses r. As a result, the WaveRNN is not directly trained based the dysarthria features Md and thereference frames Xr.

To overcome the abovementioned problems, the present invention providesa device and a method for clarifying dysarthria voices, so as to solvethe afore-mentioned problems of the prior art.

SUMMARY OF THE INVENTION

The present invention provides a device and a method for clarifyingdysarthria voices, which avoids the phase distortion of the voice signaland provides more natural and clarified voices with low noise.

In an embodiment of the present invention, a device for clarifyingdysarthria voices includes a first framing circuit, a first featureextracter, and an intelligent voice converter. The first framing circuitis configured to receive and frame a dysarthria voice signal to generatedysarthria frames. The first feature extracter is coupled to the firstframing circuit and configured to receive the dysarthria frames andextract dysarthria features from the dysarthria frames. The intelligentvoice converter is coupled to the first feature extracter and configuredto receive the dysarthria features. The intelligent voice converter isconfigured to convert the dysarthria features into an intelligent voicesignal based on an intelligent voice conversion model without receivingphases corresponding to the dysarthria features. The intelligent voiceconversion model is not trained based on dynamic time warping (DTW).

In an embodiment of the present invention, the intelligent voiceconversion model is trained by an intelligent voice training system. Theintelligent voice training system includes a second framing circuit, asecond feature extracter, a feature mapper, a voice synthesizer, and anintelligent voice trainer. The second framing circuit is configured toreceive and frame a dysarthria corpus corresponding to the dysarthriavoice signal to generate dysarthria corpus frames. The second featureextracter is coupled to the second framing circuit and configured toreceive the dysarthria corpus frames and extract from the dysarthriacorpus frames dysarthria corpus features corresponding to the dysarthriafeatures. The feature mapper is coupled to the second feature extracterand configured to receive the dysarthria corpus features and convert thedysarthria corpus features into reference corpus features correspondingto the intelligent voice signal based on a feature mapping model. Thevoice synthesizer is coupled to the feature mapper and configured toreceive the reference corpus features and convert the reference corpusfeatures into reference corpus frames based on a voice synthesizingmodel. The intelligent voice trainer is coupled to the second featureextracter and the voice synthesizer and configured to receive thereference corpus frames and the dysarthria corpus features and train theintelligent voice conversion model based on the reference corpus framesand the dysarthria corpus features.

In an embodiment of the present invention, the feature mapping model istrained by a feature mapping training system. The feature mappingtraining system includes a corpus pre-processing circuit, a mappingfeature extracter, and a feature mapping trainer. The corpuspre-processing circuit is configured to receive, frame, and align thedysarthria corpus and a reference corpus to generate the dysarthriacorpus frames and the reference corpus frames. The dysarthria corpusframes and the reference corpus frames are aligned to each other. Thereference corpus corresponds to the intelligent voice signal. Themapping feature extracter is coupled to the corpus pre-processingcircuit and configured to receive the dysarthria corpus frames and thereference corpus frames and respectively extract the dysarthria corpusfeatures and the reference corpus features from the dysarthria corpusframes and the reference corpus frames. The feature mapping trainer iscoupled to the mapping feature extracter and configured to receive thedysarthria corpus features and the reference corpus features and trainthe feature mapping model based on the dysarthria corpus features andthe reference corpus features.

In an embodiment of the present invention, the voice synthesizing modelis trained by a voice synthesizing training system. The voicesynthesizing training system includes a third framing circuit, a thirdfeature extracter, and a voice synthesizing trainer. The third framingcircuit is configured to receive and frame a reference corpus togenerate the reference corpus frames. The reference corpus correspondsto the intelligent voice signal. The third feature extracter is coupledto the third framing circuit and configured to receive the referencecorpus frames and extract the reference corpus features from thereference corpus frames. The voice synthesizing trainer is coupled tothe third framing circuit and the third feature extracter and configuredto receive the reference corpus frames and the reference corpus featuresand train the voice synthesizing model based on the reference corpusframes and the reference corpus features.

In an embodiment of the present invention, the intelligent voiceconversion model includes a feature mapping model and a voicesynthesizing model. The intelligent voice converter includes a featuremapper and a voice synthesizer. The feature mapper is coupled to thefirst feature extracter and configured to receive the dysarthriafeatures and convert the dysarthria features into reference featuresbased on the feature mapping model. The voice synthesizer is coupled tothe feature mapper and configured to receive the reference features andconvert the reference features into the intelligent voice signal basedon the voice synthesizing model.

In an embodiment of the present invention, the feature mapping model istrained by a feature mapping training system. The feature mappingtraining system includes a corpus pre-processing circuit, a mappingfeature extracter, and a feature mapping trainer. The corpuspre-processing circuit is configured to receive, frame, and align adysarthria corpus and a reference corpus to generate dysarthria corpusframes and reference corpus frames that are aligned to each other. Thedysarthria corpus corresponds to the dysarthria voice signal. Thereference corpus corresponds to the intelligent voice signal. Themapping feature extracter is coupled to the corpus pre-processingcircuit and configured to receive the dysarthria corpus frames and thereference corpus frames and respectively extract dysarthria corpusfeatures and reference corpus features from the dysarthria corpus framesand the reference corpus frames. The dysarthria corpus features and thereference corpus features respectively correspond to the dysarthriafeatures and the reference features. The feature mapping trainer iscoupled to the mapping feature extracter and configured to receive thedysarthria corpus features and the reference corpus features and trainthe feature mapping model based on the dysarthria corpus features andthe reference corpus features.

In an embodiment of the present invention, the voice synthesizing modelis trained by a voice synthesizing training system. The voicesynthesizing training system includes a second framing circuit, a secondfeature extracter, and a voice synthesizing trainer. The second framingcircuit is configured to receive and frame a reference corpus togenerate reference corpus frames. The reference corpus corresponds tothe intelligent voice signal. The second feature extracter is coupled tothe second framing circuit and configured to receive the referencecorpus frames and extract reference corpus features corresponding to thereference features from the reference corpus frames. The voicesynthesizing trainer is coupled to the second framing circuit and thesecond feature extracter and configured to receive the reference corpusframes and the reference corpus features and train the voicesynthesizing model based on the reference corpus frames and thereference corpus features.

In an embodiment of the present invention, the dysarthria featurescomprise at least one of a log power spectrum (LPS), a Mel spectrum, afundamental frequency, a Mel-frequency cepstral coefficient, and anaperiodicity. The intelligent voice conversion model includes a WaveNetor a Wave recurrent neural network (RNN).

In an embodiment of the present invention, the dysarthria featurescomprise log power spectrums. The intelligent voice converter isconfigured to convert the dysarthria features into the intelligent voicesignal using an inverse Fourier transform.

In an embodiment of the present invention, the dysarthria featurescomprise a log power spectrum (LPS), a Mel spectrum, a fundamentalfrequency, a Mel-frequency cepstral coefficient, and an aperiodicity.The intelligent voice converter is a vocoder.

In an embodiment of the present invention, a method for clarifyingdysarthria voices includes: receiving and framing a dysarthria voicesignal to generate dysarthria frames; receiving the dysarthria framesand extracting dysarthria features from the dysarthria frames; andreceiving the dysarthria features and converting the dysarthria featuresinto an intelligent voice signal based on an intelligent voiceconversion model without receiving phases corresponding to thedysarthria features; wherein the intelligent voice conversion model isnot trained based on dynamic time warping (DTW).

In an embodiment of the present invention, a method for training theintelligent voice conversion model includes: receiving and framing adysarthria corpus corresponding to the dysarthria voice signal togenerate dysarthria corpus frames; receiving the dysarthria corpusframes and extract from the dysarthria corpus frames dysarthria corpusfeatures corresponding to the dysarthria features; receiving thedysarthria corpus features and converting the dysarthria corpus featuresinto reference corpus features corresponding to the intelligent voicesignal based on a feature mapping model; receiving the reference corpusfeatures and converting the reference corpus features into referencecorpus frames based on a voice synthesizing model; and receiving thereference corpus frames and the dysarthria corpus features and trainingthe intelligent voice conversion model based on the reference corpusframes and the dysarthria corpus features.

In an embodiment of the present invention, a method for training thefeature mapping model includes: receiving, framing, and aligning thedysarthria corpus and a reference corpus to generate the dysarthriacorpus frames and the reference corpus frames, wherein the dysarthriacorpus frames and the reference corpus frames are aligned to each other,and the reference corpus corresponds to the intelligent voice signal;receiving the dysarthria corpus frames and the reference corpus framesand respectively extracting the dysarthria corpus features and thereference corpus features from the dysarthria corpus frames and thereference corpus frames; and receiving the dysarthria corpus featuresand the reference corpus features and training the feature mapping modelbased on the dysarthria corpus features and the reference corpusfeatures.

In an embodiment of the present invention, a method for training thevoice synthesizing model includes: receiving and framing a referencecorpus to generate the reference corpus frames, wherein the referencecorpus corresponds to the intelligent voice signal; receiving thereference corpus frames and extracting the reference corpus featuresfrom the reference corpus frames; and receiving the reference corpusframes and the reference corpus features and training the voicesynthesizing model based on the reference corpus frames and thereference corpus features.

In an embodiment of the present invention, the intelligent voiceconversion model includes a feature mapping model and a voicesynthesizing model. The step of receiving the dysarthria features andconverting the dysarthria features into the intelligent voice signalbased on the intelligent voice conversion model without receiving thephases includes: receiving the dysarthria features and converting thedysarthria features into reference features based on the feature mappingmodel; and receiving the reference features and convert the referencefeatures into the intelligent voice signal based on the voicesynthesizing model.

In an embodiment of the present invention, a method for training thefeature mapping model includes: receiving, framing, and aligning adysarthria corpus and a reference corpus to generate dysarthria corpusframes and reference corpus frames that are aligned to each other,wherein the dysarthria corpus corresponds to the dysarthria voicesignal, and the reference corpus corresponds to the intelligent voicesignal; receiving the dysarthria corpus frames and the reference corpusframes and respectively extracting dysarthria corpus features andreference corpus features from the dysarthria corpus frames and thereference corpus frames, wherein the dysarthria corpus features and thereference corpus features respectively correspond to the dysarthriafeatures and the reference features; and receiving the dysarthria corpusfeatures and the reference corpus features and training the featuremapping model based on the dysarthria corpus features and the referencecorpus features.

In an embodiment of the present invention, a method for training thevoice synthesizing model includes: receiving and framing a referencecorpus to generate reference corpus frames, wherein the reference corpuscorresponds to the intelligent voice signal; receiving the referencecorpus frames and extracting reference corpus features corresponding tothe reference corpus from the reference corpus frames; and receiving thereference corpus frames and the reference corpus features and trainingthe voice synthesizing model based on the reference corpus frames andthe reference corpus features.

In an embodiment of the present invention, the dysarthria featuresinclude at least one of a log power spectrum (LPS), a Mel spectrum, afundamental frequency, a Mel-frequency cepstral coefficient, and anaperiodicity. The intelligent voice conversion model includes a WaveNetor a Wave recurrent neural network (RNN).

To sum up, the device and the method for clarifying dysarthria voicesconvert dysarthria features into an intelligent voice signal based on anintelligent voice conversion model without using an inverse Fouriertransform and receiving phases corresponding to the dysarthria features.

Below, the embodiments are described in detail in cooperation with thedrawings to make easily understood the technical contents,characteristics and accomplishments of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee.

FIG. 1 is a schematic diagram illustrating a device for clarifyingdysarthria voices in the conventional technology;

FIG. 2 is a schematic diagram illustrating a voice training system inthe conventional technology;

FIG. 3 is a schematic diagram illustrating another voice training systemin the conventional technology;

FIG. 4 is a schematic diagram illustrating a device for clarifyingdysarthria voices according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an intelligent voice trainingsystem according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating a feature mapping trainingsystem according to an embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating a voice synthesizing trainingsystem according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating an intelligent voiceconverter according to an embodiment of the present invention;

FIG. 9 is a schematic diagram illustrating another feature mappingtraining system according to an embodiment of the present invention;

FIG. 10 is a schematic diagram illustrating another voice synthesizingtraining system according to an embodiment of the present invention;

FIG. 11 is a schematic diagram illustrating another intelligent voicetraining system according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating a waveform of a dysarthria voicesignal according to an embodiment of the present invention;

FIG. 13 is a diagram illustrating a waveform of a reference voice signalof a normal person according to an embodiment of the present invention;

FIG. 14 is a diagram illustrating a waveform of an intelligent voicesignal corresponding to a first implementation;

FIG. 15 is a diagram illustrating a waveform of an intelligent voicesignal corresponding to a second implementation;

FIG. 16 is a diagram illustrating a waveform of an intelligent voicesignal corresponding to a third implementation;

FIG. 17 is a diagram illustrating a waveform of an intelligent voicesignal corresponding to a fourth implementation;

FIG. 18 is a diagram illustrating a spectrum of a dysarthria voicesignal according to an embodiment of the present invention;

FIG. 19 is a diagram illustrating a spectrum of a reference voice signalof a normal person according to an embodiment of the present invention;

FIG. 20 is a diagram illustrating a spectrum of an intelligent voicesignal corresponding to the first implementation;

FIG. 21 is a diagram illustrating a spectrum of an intelligent voicesignal corresponding to the second implementation;

FIG. 22 is a diagram illustrating a spectrum of an intelligent voicesignal corresponding to the third implementation; and

FIG. 23 is a diagram illustrating a spectrum of an intelligent voicesignal corresponding to the fourth implementation.

DETAILED DESCRIPTION OF THE INVENTION

Reference will now be made in detail to embodiments illustrated in theaccompanying drawings. Wherever possible, the same reference numbers areused in the drawings and the description to refer to the same or likeparts. In the drawings, the shape and thickness may be exaggerated forclarity and convenience. This description will be directed in particularto elements forming part of, or cooperating more directly with, methodsand apparatus in accordance with the present disclosure. It is to beunderstood that elements not specifically shown or described may takevarious forms well known to those skilled in the art. Many alternativesand modifications will be apparent to those skilled in the art, onceinformed by the present disclosure.

Unless otherwise specified, some conditional sentences or words, such as“can”, “could”, “might”, or “may”, usually attempt to express that theembodiment in the present invention has, but it can also be interpretedas a feature, element, or step that may not be needed. In otherembodiments, these features, elements, or steps may not be required.

Certain terms are used throughout the description and the claims torefer to particular components. One skilled in the art appreciates thata component may be referred to as different names. This disclosure doesnot intend to distinguish between components that differ in name but notin function. In the description and in the claims, the term “comprise”is used in an open-ended fashion, and thus should be interpreted to mean“include, but not limited to.” The phrases “be coupled to,” “couplesto,” and “coupling to” are intended to compass any indirect or directconnection. Accordingly, if this disclosure mentioned that a firstdevice is coupled with a second device, it means that the first devicemay be directly or indirectly connected to the second device throughelectrical connections, wireless communications, optical communications,or other signal connections with/without other intermediate devices orconnection means.

Reference throughout this specification to “one embodiment” or “anembodiment” means that a particular feature, structure, orcharacteristic described in connection with the embodiment is includedin at least one embodiment. Thus, the appearances of the phrases “in oneembodiment” or “in an embodiment” in various places throughout thisspecification are not necessarily all referring to the same embodiment.

FIG. 4 is a schematic diagram illustrating a device for clarifyingdysarthria voices according to an embodiment of the present invention.Referring to FIG. 4, an embodiment of the device for clarifyingdysarthria voices 4 is introduced as follows. The device for clarifyingdysarthria voices 4 does not use an inverse fast Fourier transformer andan interpolation circuit to avoid phase distortion and provide morenatural and clarified voices with low noise. The device for clarifyingdysarthria voices 4 includes a first framing circuit 41, a first featureextracter 42, and an intelligent voice converter 43. The first featureextracter 42 is coupled to the first framing circuit 41 and theintelligent voice converter 43. The first framing circuit 41 receivesand frames a dysarthria voice signal D to generate dysarthria framesFRD. The first feature extracter 42 receives the dysarthria frames FRDand extracts dysarthria features FD from the dysarthria frames FRD. Theintelligent voice converter 43 receives the dysarthria features FD. Theintelligent voice converter 43 converts the dysarthria features FD intoan intelligent voice signal V of normal persons based on an intelligentvoice conversion model without receiving phases corresponding to thedysarthria features FD. Besides, the intelligent voice conversion modelis not trained based on dynamic time warping (DTW). The dysarthriafeatures FD include at least one of a log power spectrum (LPS), a Melspectrum, a fundamental frequency, a Mel-frequency cepstral coefficient,and an aperiodicity. The intelligent voice conversion model may beimplemented with a neural network, such as a WaveNet or a Wave recurrentneural network (RNN). For example, when the dysarthria features FDinclude log power spectrums, the intelligent voice converter 43 isimplemented with an Inverse Fourier transformer that transforms thedysarthria features FD into the intelligent voice signal V. When thedysarthria features FD include a Mel spectrum, a fundamental frequency,a Mel-frequency cepstral coefficient, and an aperiodicity, theintelligent voice converter 43 is implemented with a vocoder.

The intelligent voice conversion model is trained by an intelligentvoice training system. FIG. 5 is a schematic diagram illustrating anintelligent voice training system according to an embodiment of thepresent invention. Referring to FIG. 4 and FIG. 5, the intelligent voicetraining system 5 may include a second framing circuit 51, a secondfeature extracter 52, a feature mapper 53, a voice synthesizer 54, andan intelligent voice trainer 55. The second feature extracter 52 iscoupled to the second framing circuit 51. The feature mapper 53 iscoupled to the second feature extracter 52. The voice synthesizer 54 iscoupled to the feature mapper 53. The intelligent voice trainer 55 iscoupled to the second feature extracter 52 and the voice synthesizer 54.The second framing circuit 51 receives and frames a dysarthria corpus dcorresponding to the dysarthria voice signal D to generate dysarthriacorpus frames frd. The second feature extracter 52 receives thedysarthria corpus frames frd and extract from the dysarthria corpusframes frd dysarthria corpus features fd corresponding to the dysarthriafeatures FD. The feature mapper 53 receives the dysarthria corpusfeatures fd and converts the dysarthria corpus features fd intoreference corpus features fr corresponding to the intelligent voicesignal V based on a feature mapping model. The voice synthesizer 54receives the reference corpus features fr and converts the referencecorpus features fr into reference corpus frames frr based on a voicesynthesizing model. The intelligent voice trainer 55 receives thereference corpus frames frr and the dysarthria corpus features fd andtrains the intelligent voice conversion model based on the referencecorpus frames frr and the dysarthria corpus features fd. The featuremapping model and the voice synthesizing model are implemented withWaveNets or Wave recurrent neural networks (RNNs).

The feature mapping model is trained by a feature mapping trainingsystem. FIG. 6 is a schematic diagram illustrating a feature mappingtraining system according to an embodiment of the present invention.Referring to FIG. 4 and FIG. 6, the feature mapping training system 6may include a corpus pre-processing circuit 61, a mapping featureextracter 62, and a feature mapping trainer 63. The mapping featureextracter 62 may be a short time Fourier transformer (STFT), but thepresent invention is not limited thereto. The mapping feature extracter62 is coupled to the corpus pre-processing circuit 61 and the featuremapping trainer 63. The corpus pre-processing circuit 61 receives,frames, and aligns the dysarthria corpus d and a reference corpus r togenerate the dysarthria corpus frames frd and the reference corpusframes frr. The dysarthria corpus frames frd and the reference corpusframes frr are aligned to each other. The reference corpus r correspondsto the intelligent voice signal V. The mapping feature extracter 62receives the dysarthria corpus frames frd and the reference corpusframes frr and respectively extracts the dysarthria corpus features fdand the reference corpus features fr from the dysarthria corpus framesfrd and the reference corpus frames frr. The feature mapping trainer 63receives the dysarthria corpus features fd and the reference corpusfeatures fr and train the feature mapping model based on the dysarthriacorpus features fd and the reference corpus features fr.

The voice synthesizing model is trained by a voice synthesizing trainingsystem. FIG. 7 is a schematic diagram illustrating a voice synthesizingtraining system according to an embodiment of the present invention.Referring to FIG. 4 and FIG. 7, the voice synthesizing training system 7may include a third framing circuit 71, a third feature extracter 72,and a voice synthesizing trainer 73. The third framing circuit 71 iscoupled to the third feature extracter 72 and the voice synthesizingtrainer 73. The third feature extracter 72 is coupled to the voicesynthesizing trainer 73. The third framing circuit 71 receives andframes a reference corpus r to generate the reference corpus frames frr.The reference corpus r corresponds to the intelligent voice signal V.The third feature extracter 72 receives the reference corpus frames frrand extracts the reference corpus features fr from the reference corpusframes frr. The voice synthesizing trainer 73 receives the referencecorpus frames frr and the reference corpus features fr and trains thevoice synthesizing model based on the reference corpus frames frr andthe reference corpus features fr. Since the reference corpus frames frrand the reference corpus features fr come from the reference corpus r,the reference corpus frames frr and the reference corpus features fr areautomatically aligned to each other.

FIG. 8 is a schematic diagram illustrating an intelligent voiceconverter according to an embodiment of the present invention. Referringto FIG. 4 and FIG. 8, the intelligent voice converter 43 is introducedas follows. In some embodiment of the present invention, the intelligentvoice converter 43 may include a feature mapper 431 and a voicesynthesizer 432. The intelligent voice conversion model may include afeature mapping model and a voice synthesizing model. The featuremapping model and the voice synthesizing model are implemented withneural networks, such as WaveNets or Wave recurrent neural networks(RNNs). The feature mapper 431 is coupled to the first feature extracter42 and the voice synthesizer 432. The feature mapper 431 receives thedysarthria features FD and converts the dysarthria features FD intoreference features FR of normal persons based on the feature mappingmodel. The voice synthesizer 432 receives the reference features FR andconverts the reference features FR into the intelligent voice signal Vbased on the voice synthesizing model.

FIG. 9 is a schematic diagram illustrating another feature mappingtraining system according to an embodiment of the present invention. Thefeature mapping model of FIG. 8 is trained by a feature mapping trainingsystem. Referring to FIG. 8 and FIG. 9, the feature mapping trainingsystem 8 may include a corpus pre-processing circuit 81, a mappingfeature extracter 82, and a feature mapping trainer 83. The mappingfeature extracter 82 may be implemented with a short time Fouriertransformer (STFT), but the present invention is not limited thereto.The mapping feature extracter 82 is coupled to the corpus pre-processingcircuit 81 and the feature mapping trainer 83. The corpus pre-processingcircuit 81 receives, frames, and aligns a dysarthria corpus d and areference corpus r to generate dysarthria corpus frames frd andreference corpus frames frr that are aligned to each other. Thedysarthria corpus d corresponds to the dysarthria voice signal D. Thereference corpus r corresponds to the intelligent voice signal V. Themapping feature extracter 82 receives the dysarthria corpus frames frdand the reference corpus frames frr and respectively extracts dysarthriacorpus features fd and reference corpus features fr from the dysarthriacorpus frames frd and the reference corpus frames frr. The dysarthriacorpus features fd and the reference corpus features fr respectivelycorrespond to the dysarthria features FD and the reference features FR.The feature mapping trainer 83 receives the dysarthria corpus featuresfd and the reference corpus features fr and trains the feature mappingmodel based on the dysarthria corpus features fd and the referencecorpus features fr.

FIG. 10 is a schematic diagram illustrating another voice synthesizingtraining system according to an embodiment of the present invention. Thevoice synthesizing model of FIG. 8 is trained by a voice synthesizingtraining system. Referring to FIG. 8 and FIG. 10, the voice synthesizingtraining system 9 is extraneous to the dysarthria voice signal D. Thevoice synthesizing training system 9 may include a second framingcircuit 91, a second feature extracter 92, and a voice synthesizingtrainer 93. The second feature extracter 92 is coupled to the secondframing circuit 91 and the voice synthesizing trainer 93. The secondframing circuit 91 receives and frames a reference corpus r to generatereference corpus frames frr. The reference corpus r corresponds to theintelligent voice signal V. The second feature extracter 92 receives thereference corpus frames frr and extract reference corpus features frcorresponding to the reference features FR from the reference corpusframes frr. The voice synthesizing trainer 93 receives the referencecorpus frames frr and the reference corpus features fr and trains thevoice synthesizing model based on the reference corpus frames frr andthe reference corpus features fr.

The performance of four implementations is compared and introduced asfollows. For example, the dysarthria features are implemented with a logpower spectrum. The dysarthria voice signal is framed with a samplingrate of 16000 sample points/seconds. As a result, each dysarthria frameincludes 1024 sample points and a length of the integer power of 2 touse a fast Fourier transform. In order to prevent information betweenthe neighboring dysarthria frames from varying too much, the dysarthriaframes are overlapped to calculate the next dysarthria frame. Thus, thehop size of the dysarthria frames is set to 256 sample points. The STFTextracts a log power spectrum with 513 sample points as the dysarthriafeatures from each dysarthria frame with 1024 sample points.

The first implementation is implemented with the device 1 for clarifyingdysarthria voices of FIG. 1, wherein the dysarthria features include alog power spectrum with 513 sample points. The LPS mapping DNN 14 isimplemented with a fully-connected deep neural network (FCDNN), whichincludes an input layer with 513 sample points, three hiding layers with1024 sample points, and an output layer with 513 sample points. TheFCDNN is trained by the dysarthria corpus d and the reference corpus r,wherein the dysarthria corpus d and the reference corpus r are alignedto each other by DTW. Each frame of the trained data has a time lengthof 64 ms and overlaps to each other, wherein the hop size of the framesis 16 sample points.

In the second implementation, the fast Fourier transformer 16 and theinterpolation circuit 17 of the device 1 for clarifying dysarthriavoices of FIG. 1 are replaced with the voice synthesizer 432 of FIG. 8.The dysarthria features include a log power spectrum with 513 samplepoints. The LPS mapping DNN 14 is implemented with a513×1024×1024×1024×513 fully-connected deep neural network (FCDNN). Thevoice synthesizer 432 is implemented with a WaveRNN. The WaveRNN istrained by the reference corpus features and the reference corpus framesthat are aligned to the reference corpus features. The reference corpusfeatures and the reference corpus frames are generated due to thereference corpus with 319 sentences. The reference corpus frames overlapeach other. Each reference corpus frame has a time length of 64 ms. Thehop size of the reference corpus frames is 256 sample points.

The third implementation is implemented with a device for clarifyingdysarthria voices that includes a framing circuit, a short time Fouriertransformer, and a pre-trained WaveRNN. The framing circuit processesdysarthria voices to generate frames. The short time Fourier transformerextracts from each frame a log power spectrum with 513 sample points asdysarthria features. The WaveRNN converts the dysarthria features intoan intelligent voice signal. The WaveRNN is trained by the voicetraining system 3 of FIG. 3.

The fourth implementation is implemented with a device 4 for clarifyingdysarthria voices of FIG. 4. The intelligent voice conversion model istrained by an intelligent voice training system 5 of FIG. 11. The secondfeature extracter 52 is implemented with a short time Fouriertransformer 521. The feature mapper 52 is implemented with a log powerspectrum mapping deep neural network (LPS mapping DNN) 531. The voicesynthesizer 54 is implemented with an inverse fast Fourier transformer541 and an interpolation circuit 542 that are coupled to each other.

FIG. 12 is a diagram illustrating a waveform of a dysarthria voicesignal according to an embodiment of the present invention. FIG. 13 is adiagram illustrating a waveform of a reference voice signal of a normalperson according to an embodiment of the present invention. FIG. 14 is adiagram illustrating a waveform of an intelligent voice signalcorresponding to a first implementation. FIG. 15 is a diagramillustrating a waveform of an intelligent voice signal corresponding toa second implementation. FIG. 16 is a diagram illustrating a waveform ofan intelligent voice signal corresponding to a third implementation.FIG. 17 is a diagram illustrating a waveform of an intelligent voicesignal corresponding to a fourth implementation. FIG. 18 is a diagramillustrating a spectrum of a dysarthria voice signal according to anembodiment of the present invention. FIG. 19 is a diagram illustrating aspectrum of a reference voice signal of a normal person according to anembodiment of the present invention. FIG. 20 is a diagram illustrating aspectrum of an intelligent voice signal corresponding to the firstimplementation. FIG. 21 is a diagram illustrating a spectrum of anintelligent voice signal corresponding to the second implementation.FIG. 22 is a diagram illustrating a spectrum of an intelligent voicesignal corresponding to the third implementation. FIG. 23 is a diagramillustrating a spectrum of an intelligent voice signal corresponding tothe fourth implementation. From FIG. 12 and FIG. 18, it is known thatnormal persons difficultly understand the dysarthria voice signal. Afterlistening to the dysarthria voice signal a few times, normal persons canunderstand the dysarthria voice signal. From FIG. 13 and FIG. 19, it isknown that normal persons can understand the dysarthria voice signal.From FIG. 14 and FIG. 20, it is known that the first implementationimproves the voice comprehension of the dysarthria voice signal.However, the intelligent voice signal has many noises such that thefeeling is bad in hearing. As illustrated in FIG. 15 and FIG. 21, theintelligent voice signal of the second implementation lacks noisecompared with the first implementation. Thus, the second implementationhas a better feeling in hearing. From FIG. 15 and FIG. 22, it is knownthat the intelligent voice signal of the third implementation is likethe reference voice signal. However, normal persons do not stillunderstand the intelligent voice signal of the third implementation.This is because the DTW has limited abilities for aligning frames tocause the bad voice conversion effect of the WaveRNN. From FIG. 17 andFIG. 23, it is known that the intelligent voice signals of the fourthimplementation and the second implementation have similar hearingeffect.

According to the embodiments provided above, the device and the methodfor clarifying dysarthria voices convert dysarthria features into anintelligent voice signal based on an intelligent voice conversion modelwithout using an inverse Fourier transform and receiving phasescorresponding to the dysarthria features.

The embodiments described above are only to exemplify the presentinvention but not to limit the scope of the present invention.Therefore, any equivalent modification or variation according to theshapes, structures, features, or spirit disclosed by the presentinvention is to be also included within the scope of the presentinvention.

What is claimed is:
 1. A device for clarifying dysarthria voicescomprising: a first framing circuit configured to receive and frame adysarthria voice signal to generate dysarthria frames; a first featureextracter coupled to the first framing circuit and configured to receivethe dysarthria frames and extract dysarthria features from thedysarthria frames; and an intelligent voice converter coupled to thefirst feature extracter and configured to receive the dysarthriafeatures, wherein the intelligent voice converter is configured toconvert the dysarthria features into an intelligent voice signal basedon an intelligent voice conversion model without receiving phasescorresponding to the dysarthria features; wherein the intelligent voiceconversion model is not trained based on dynamic time warping (DTW). 2.The device for clarifying dysarthria voices according to claim 1,wherein the intelligent voice conversion model is trained by anintelligent voice training system, and the intelligent voice trainingsystem comprises: a second framing circuit configured to receive andframe a dysarthria corpus corresponding to the dysarthria voice signalto generate dysarthria corpus frames; a second feature extracter coupledto the second framing circuit and configured to receive the dysarthriacorpus frames and extract from the dysarthria corpus frames dysarthriacorpus features corresponding to the dysarthria features; a featuremapper coupled to the second feature extracter and configured to receivethe dysarthria corpus features and convert the dysarthria corpusfeatures into reference corpus features corresponding to the intelligentvoice signal based on a feature mapping model; a voice synthesizercoupled to the feature mapper and configured to receive the referencecorpus features and convert the reference corpus features into referencecorpus frames based on a voice synthesizing model; and an intelligentvoice trainer coupled to the second feature extracter and the voicesynthesizer and configured to receive the reference corpus frames andthe dysarthria corpus features and train the intelligent voiceconversion model based on the reference corpus frames and the dysarthriacorpus features.
 3. The device for clarifying dysarthria voicesaccording to claim 2, wherein the feature mapping model is trained by afeature mapping training system, and the feature mapping training systemcomprises: a corpus pre-processing circuit configured to receive, frame,and align the dysarthria corpus and a reference corpus to generate thedysarthria corpus frames and the reference corpus frames, wherein thedysarthria corpus frames and the reference corpus frames are aligned toeach other, and the reference corpus corresponds to the intelligentvoice signal; a mapping feature extracter coupled to the corpuspre-processing circuit and configured to receive the dysarthria corpusframes and the reference corpus frames and respectively extract thedysarthria corpus features and the reference corpus features from thedysarthria corpus frames and the reference corpus frames; and a featuremapping trainer coupled to the mapping feature extracter and configuredto receive the dysarthria corpus features and the reference corpusfeatures and train the feature mapping model based on the dysarthriacorpus features and the reference corpus features.
 4. The device forclarifying dysarthria voices according to claim 2, wherein the voicesynthesizing model is trained by a voice synthesizing training system,and the voice synthesizing training system comprises: a third framingcircuit configured to receive and frame a reference corpus to generatethe reference corpus frames, wherein the reference corpus corresponds tothe intelligent voice signal; a third feature extracter coupled to thethird framing circuit and configured to receive the reference corpusframes and extract the reference corpus features from the referencecorpus frames; and a voice synthesizing trainer coupled to the thirdframing circuit and the third feature extracter and configured toreceive the reference corpus frames and the reference corpus featuresand train the voice synthesizing model based on the reference corpusframes and the reference corpus features.
 5. The device for clarifyingdysarthria voices according to claim 1, wherein the intelligent voiceconversion model comprises a feature mapping model and a voicesynthesizing model, and the intelligent voice converter comprises: afeature mapper coupled to the first feature extracter and configured toreceive the dysarthria features and convert the dysarthria features intoreference features based on the feature mapping model; and a voicesynthesizer coupled to the feature mapper and configured to receive thereference features and convert the reference features into theintelligent voice signal based on the voice synthesizing model.
 6. Thedevice for clarifying dysarthria voices according to claim 5, whereinthe feature mapping model is trained by a feature mapping trainingsystem, and the feature mapping training system comprises: a corpuspre-processing circuit configured to receive, frame, and align adysarthria corpus and a reference corpus to generate dysarthria corpusframes and reference corpus frames that are aligned to each other,wherein the dysarthria corpus corresponds to the dysarthria voicesignal, and the reference corpus corresponds to the intelligent voicesignal; a mapping feature extracter coupled to the corpus pre-processingcircuit and configured to receive the dysarthria corpus frames and thereference corpus frames and respectively extract dysarthria corpusfeatures and reference corpus features from the dysarthria corpus framesand the reference corpus frames, wherein the dysarthria corpus featuresand the reference corpus features respectively correspond to thedysarthria features and the reference features; and a feature mappingtrainer coupled to the mapping feature extracter and configured toreceive the dysarthria corpus features and the reference corpus featuresand train the feature mapping model based on the dysarthria corpusfeatures and the reference corpus features.
 7. The device for clarifyingdysarthria voices according to claim 5, wherein the voice synthesizingmodel is trained by a voice synthesizing training system, and the voicesynthesizing training system comprises: a second framing circuitconfigured to receive and frame a reference corpus to generate referencecorpus frames, wherein the reference corpus corresponds to theintelligent voice signal; a second feature extracter coupled to thesecond framing circuit and configured to receive the reference corpusframes and extract reference corpus features corresponding to thereference features from the reference corpus frames; and a voicesynthesizing trainer coupled to the second framing circuit and thesecond feature extracter and configured to receive the reference corpusframes and the reference corpus features and train the voicesynthesizing model based on the reference corpus frames and thereference corpus features.
 8. The device for clarifying dysarthriavoices according to claim 1, wherein the dysarthria features comprise atleast one of a log power spectrum (LPS), a Mel spectrum, a fundamentalfrequency, a Mel-frequency cepstral coefficient, and an aperiodicity,and the intelligent voice conversion model comprises a WaveNet or a Waverecurrent neural network (RNN).
 9. The device for clarifying dysarthriavoices according to claim 1, wherein the dysarthria features compriselog power spectrums, and the intelligent voice converter is configuredto convert the dysarthria features into the intelligent voice signalusing an inverse Fourier transform.
 10. The device for clarifyingdysarthria voices according to claim 1, wherein the dysarthria featurescomprise a log power spectrum (LPS), a Mel spectrum, a fundamentalfrequency, a Mel-frequency cepstral coefficient, and an aperiodicity,and the intelligent voice converter is a vocoder.
 11. A method forclarifying dysarthria voices comprising: receiving and framing adysarthria voice signal to generate dysarthria frames; receiving thedysarthria frames and extracting dysarthria features from the dysarthriaframes; and receiving the dysarthria features and converting thedysarthria features into an intelligent voice signal based on anintelligent voice conversion model without receiving phasescorresponding to the dysarthria features; wherein the intelligent voiceconversion model is not trained based on dynamic time warping (DTW). 12.The method for clarifying dysarthria voices according to claim 11,wherein a method for training the intelligent voice conversion modelcomprises: receiving and framing a dysarthria corpus corresponding tothe dysarthria voice signal to generate dysarthria corpus frames;receiving the dysarthria corpus frames and extract from the dysarthriacorpus frames dysarthria corpus features corresponding to the dysarthriafeatures; receiving the dysarthria corpus features and converting thedysarthria corpus features into reference corpus features correspondingto the intelligent voice signal based on a feature mapping model;receiving the reference corpus features and converting the referencecorpus features into reference corpus frames based on a voicesynthesizing model; and receiving the reference corpus frames and thedysarthria corpus features and training the intelligent voice conversionmodel based on the reference corpus frames and the dysarthria corpusfeatures.
 13. The method for clarifying dysarthria voices according toclaim 12, wherein a method for training the feature mapping modelcomprises: receiving, framing, and aligning the dysarthria corpus and areference corpus to generate the dysarthria corpus frames and thereference corpus frames, wherein the dysarthria corpus frames and thereference corpus frames are aligned to each other, and the referencecorpus corresponds to the intelligent voice signal; receiving thedysarthria corpus frames and the reference corpus frames andrespectively extracting the dysarthria corpus features and the referencecorpus features from the dysarthria corpus frames and the referencecorpus frames; and receiving the dysarthria corpus features and thereference corpus features and training the feature mapping model basedon the dysarthria corpus features and the reference corpus features. 14.The method for clarifying dysarthria voices according to claim 12,wherein a method for training the voice synthesizing model comprises:receiving and framing a reference corpus to generate the referencecorpus frames, wherein the reference corpus corresponds to theintelligent voice signal; receiving the reference corpus frames andextracting the reference corpus features from the reference corpusframes; and receiving the reference corpus frames and the referencecorpus features and training the voice synthesizing model based on thereference corpus frames and the reference corpus features.
 15. Themethod for clarifying dysarthria voices according to claim 11, whereinthe intelligent voice conversion model comprises a feature mapping modeland a voice synthesizing model, and the step of receiving the dysarthriafeatures and converting the dysarthria features into the intelligentvoice signal based on the intelligent voice conversion model withoutreceiving the phases comprises: receiving the dysarthria features andconverting the dysarthria features into reference features based on thefeature mapping model; and receiving the reference features and convertthe reference features into the intelligent voice signal based on thevoice synthesizing model.
 16. The method for clarifying dysarthriavoices according to claim 15, wherein a method for training the featuremapping model comprises: receiving, framing, and aligning a dysarthriacorpus and a reference corpus to generate dysarthria corpus frames andreference corpus frames that are aligned to each other, wherein thedysarthria corpus corresponds to the dysarthria voice signal, and thereference corpus corresponds to the intelligent voice signal; receivingthe dysarthria corpus frames and the reference corpus frames andrespectively extracting dysarthria corpus features and reference corpusfeatures from the dysarthria corpus frames and the reference corpusframes, wherein the dysarthria corpus features and the reference corpusfeatures respectively correspond to the dysarthria features and thereference features; and receiving the dysarthria corpus features and thereference corpus features and training the feature mapping model basedon the dysarthria corpus features and the reference corpus features. 17.The method for clarifying dysarthria voices according to claim 15,wherein a method for training the voice synthesizing model comprises:receiving and framing a reference corpus to generate reference corpusframes, wherein the reference corpus corresponds to the intelligentvoice signal; receiving the reference corpus frames and extractingreference corpus features corresponding to the reference corpus from thereference corpus frames; and receiving the reference corpus frames andthe reference corpus features and training the voice synthesizing modelbased on the reference corpus frames and the reference corpus features.18. The method for clarifying dysarthria voices according to claim 11,wherein the dysarthria features comprise at least one of a log powerspectrum (LPS), a Mel spectrum, a fundamental frequency, a Mel-frequencycepstral coefficient, and an aperiodicity, and the intelligent voiceconversion model comprises a WaveNet or a Wave recurrent neural network(RNN).